Finding biomedical categories in Medline®

نویسندگان

  • Lana Yeganova
  • Won Kim
  • Donald C. Comeau
  • W. John Wilbur
چکیده

BACKGROUND There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. RESULTS We study and compare these two alternative sets of terms to identify semantic categories in Medline. We find that both approaches produce reasonable terms as potential categories. We also find that there is a significant agreement between the two sets of terms. The overlap between the two methods improves our confidence regarding categories predicted by these independent methods. CONCLUSIONS This study is an initial attempt to extract categories that are discussed in Medline. Rather than imposing external ontologies on Medline, our methods allow categories to emerge from the text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Update on Safety and Efficacy of HPV Vaccines: Focus on Gardasil

The human papillomavirus (HPV) is a highly contagious and prevalent virus that is primarily sexually transmitted. The Gardasil® quadrivalent vaccine, the Cevarix® bivalent vaccine and the Gardasil® 9 nonavalent vaccine were developed to prevent the spread of HPV as well as the incidence of its associated diseases. The aim of this mini-review is to critically analyze the safety and efficacy of b...

متن کامل

Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion

BIOMEDICAL TEXTS CAN BE TYPICALLY REPRESENTED BY FOUR RHETORICAL CATEGORIES: introduction, methods, results and discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied approaches to automatically classify sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences...

متن کامل

Tagging gene and protein names in full text articles

Current information extraction efforts in the biomedical domain tend to focus on finding entities and facts in structured databases or MEDLINE abstracts. We apply a gene and protein name tagger trained on Medline abstracts (ABGene) to a randomly selected set of full text journal articles in the biomedical domain. We show the effect of adaptations made in response to the greater heterogeneity o...

متن کامل

Project Final Report

Vast amount of literatures for biomedical research is available online, in MEDLINE database. This helps the biomedical scientists to have instant access to literatures and references they need. But finding a manageable subset of literatures that are relevant to their current research is hard because: (1) the number of these articles are growing very fast , and (2) each disease (and gene) has di...

متن کامل

Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Tw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2012